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Abstract 



Wc consider the problem of restructuring compressed texts without exphcit decompression. 

We present algorithms which allow conversions from compressed representations of a string 

^Q ■ T produced by any grammar-based compression algorithm, to representations produced by 

^*\ I several specific compression algorithms including LZ77, LZ78, run length encoding, and some 

• ■ grammar based compression algorithms. These are the first algorithms that achieve running 

(J I times polynomial in the size of the compressed input and output representations of T. Since 

most of the representations we consider can achieve exponential compression, our algorithms 

are theoretically faster in the worst case, than any algorithm which first decompresses the string 

^ . for the conversion. 
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1 Introduction 

Data compression is an indispensable technology for the handling of large scale data available 
today. The traditional objective of compression has been to save storage and communication costs, 
whereas actually using the data normally requires a decompression step which can require enormous 
computational resources. However, recent advances in compressed string processing algorithms give 
us an intriguing new perspective in which compression can be regarded as a form of pre-processing 
which not only reduces space requirements for storage, but allows efficient processing of the strings, 
including compressed pattern matching pH |40l [101 E] > string indices [33l [71 [22] , edit distance and 
its variants [HI [151 [3H] , and various other applications [121 IMl El 1301 [2] . These methods assume a 
compressed representation of the text as input, and process them without explicit decompression. 
An interesting property of these methods is that they can be theoretically - and sometimes even 
practically - faster than algorithms which work on an uncompressed representation of the same 
data. 

The main focus of this paper is to develop a framework in which various processing on strings can 
be conducted entirely in the world of compressed representations. A primary tool for this objective 
is restructuring, or conversion, of the compressed representation. Key results for this problem were 
obtained independently by Rytter [36] and Charikar et al. [5]: given a non-self referential LZ77- 
encoding of size n that represents a string of length iV, they gave algorithms for constructing a 
balanced grammar of size at most 0(nlog(A^/n)) in output linear time. The size of the resulting 
grammar is an 0{\og{N / g)) approximation of the smallest grammar whose size is g. Grammars are 
generally easier to handle than the LZ-encodings, for example, in compressed pattern matching jllj . 
and this result is the motivational backbone of many efficient algorithms on grammar compressed 
strings. 

Our Results: In this paper, we present a comprehensive collection of new algorithms for restruc- 
turing to and from compressed texts represented in terms of run length encoding (RLE), LZ77 and 
LZ78 encodings, grammar based compressor RE-PAIR and BISECTION, edit sensitive parsing 
(ESP), straight line programs (SLPs), and admissible grammars. All algorithms achieve running 
times polynomial in the size of the compressed input and output representations of the string. Since 
(most of) the representations we consider can achieve exponential compression, our algorithms are 
theoretically faster in the worst case, than any algorithm which first decompresses the string for 
the conversion. Figure [1] summarizes our results. Our algorithms immediately allow the following 
applications to be solvable in polynomial time in the compressed world: 

Dynamic compressed texts: Although data structures for dynamic compressed texts have been 
studied somewhat in the literature [3l [131 U [271 [Ml [23] , grammar based or LZ77 compression have 
not been considered in this perspective. It has recently been argued that for highly repetitive strings, 
grammar based compression and LZ77 compression algorithms are better suited and achieve better 
compression [71 [22] . 

Modification of the grammar corresponding to edit operations on the string can be conducted 
in 0{h) time, where h is the height of the grammar. (Note that when the grammar is balanced, 
h = 0{\ogN) even in the worst case.) However, these modifications are ad-hoc, and do not assure 
that the resulting grammar is a good compressed representation of the string, and repeated edit 
operations will inevitably cause degradation on the compression ratio. Note that previous work 
of Rytter and Charikar et al. are not sufficient in this respect: their algorithms can balance an 
arbitrary grammar, but they must be given an LZ-encoding of the modified string in order for the 
grammar to be small. 

Post-selection of compression format: Some methods in the field of data mining and machine 
learning utilize compression as a means of detecting and extracting meaningful information from 
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Figure 1: Summary of transformations between compressed representations. The label of each arc 
shows the time complexity of each transformation, where n and m are respectively the input and 
output sizes of each transformation, and N is the length of the uncompressed string. The broken 
arcs mean naive 0(ri)-time transformations. Complexities without references are results shown in 
this paper. 

string data [U [6] . Compression of a given string is achieved by exploiting various regularities con- 
tained in the string, and since different compression algorithms capture different regularities, the 
usefulness of a specific representation will vary depending on the application. As it is impossi- 
ble to predetermine the best compression algorithm for all future applications, conversion of the 
representation is an essential task. 

For example, the normalized compression distance (NCD) [6] between two strings X and Y 
with respect to compression algorithm A is defined by the values C^(Xy), Ca{X), and C^(y) 
which respectively denote the sizes of the compressed representation of strings XY, X, and Y 
when compressed by algorithm A. Restructuring enables us to solve, in the compressed world, the 
problem of calculating the NCD with respect to some compression algorithm, given strings which 
were compressed previously by a (possibly) different compression algorithm. 

2 Preliminaries 



2.1 Notations 

Let S be a finite alphabet. An element of S* is called a string. The length of a string S is denoted 
by l^l. The empty string e is a string of length 0, namely, |e| = 0. For a string S = XYZ, X, Y 
and Z are called a prefix, substring, and suffix of S, respectively. The set of all substrings of a string 
S is denoted by Substr{S). The i-th character of a string S is denoted by S[i] for 1 < i < \S\, and 
the substring of a string S that begins at position i and ends at position j is denoted hy S[i : j] for 
1 < ^ < J ^ \S\- For convenience, let S[i : j] = e if j < i. For any strings S and P, let Occ{S,P) 
be the set of occurrences of P in S, i.e., Occ{S, P) = {/c > | S[k : k + \P\ — 1] = P}. 



We shall assume that the computer word size is at least log l^j, and hence, values representing 
lengths and positions of S in our algorithms can be manipulated in constant time. 

2.2 SufRx Arrays and LCP Arrays 

The suffix array SA [28] of any string S is an array of length \S\ such that SA[i] = j, where S[j : \S\] 
is the z-th lexicographically smallest suffix of S. Let lcp{Si, S2) is the length of the longest common 
prefix of Si and 82- The Icp array of any string S is an array of length 15*1 such that LCP[i] is 
lcp{S[SA[i - 1] : |5|], S[SA[i] : \S\]) for 2 < i < |5|, and LCP[1] = 0. The suffix array for any string 
S can be constructed in OdS*!) time (e.g. |17j ) assuming an integer alphabet. Given the string and 
suffix array, the Icp array can also be calculated in OdSj) time [T9] . 

2.3 Run Length Encoding 

Definition 1 The Run-Length (RL) factorization of a string S is the factorization fi,. ■ ■ ,fn of 
S such that for every i = 1, . . . ,n, factor /j is the longest prefix of fi- ■ ■ fn with fi € F, where 
^ = UaesKb>0}. 

We note that each factor fi can be written as fi = a^' for some symbol aj G E and some integer 
Pi > and the repeating symbols Oj and Oj+i of consecutive factors fi and /j+i are different. The 
output of RLE is a sequence of pairs of symbol Oj and integer pi. The number of distinct bigrams 
occurring in S is at most 2n — 1, since these are {ajOj | 1 < i < n} U {ajaj+i \1 <i < n}. 

2.4 LZ Encodings 

LZ encodings are dynamic dictionary based encodings. There are two main variants for LZ encod- 
ings, LZ78 and LZ77. 

The LZ78 encoding [32] has several variants. One most popular variant would be the LZW 
encoding j39j . which is based on the LZ78 factorization defined below. 

Definition 2 (LZ78 factorization) The LZ78-factorization of a string S is the factorization 
fi, ■ ■ ■ , fn of S where for every i = 1, . . . ,n, factor fi is the longest prefix of fi- ■ ■ fn with fi £ Fi, 
where Fi is defined by Fi = T, and Fi^i = Fj U {/j/j-|_i[l]}. 

The output is the sequence of IDs of factors fi in Fi. We note that Fi can be recovered from this 
sequence and thus is not included in the output. 

The LZ77 encoding [H] also has many variants. The LZSS encoding [37] is based on the 
LZ77 factorization below. The LZ77 factorization has two variations depending upon whether 
self- references are allowed. 

Definition 3 (LZ77 factorization w/o self-references) The LZ77- factorization without self- 
references of a string S is the factorization /i, . . . , /n of S such that for every i = 1, . . . , n, factor 
fi is the longest prefix of fi- - - fn with fi € Fi, where Fi = Substr{fi - - - /j_i) U S. 

Definition 4 (LZ77 factorization w/ self-references) The LZ77- factorization with self-references 
of a string S is the factorization /i, . . . , /„ of S such that for every i = 1, . . . , n, factor fi is the 
longest prefix of fi- - - fn with fi £ Fi, where Fi = Substr{fi - - - fi^if-) U S, where f[ is the prefix of 
fi obtained by removing the last symbol. 

The LZSS is based on the LZ77 with self-references and its output is a sequence of pointers to 
factors fi- 



2.5 Grammar-based compression methods 

An admissible grammar [20j is a context-free grammar that generates a single string. 

2.5.1 Re-pair 

Starting with wi = S, we repeat the following until no bigrams occur more than once in wf. we 
find a most frequent bigram 7j in the string Wi, and then replace every non-overlapping occurrence 
of 7i in Wi with a new variable Xi to obtain string ifj+i. Let r be the number of iterations. The 
resulting grammar has the production rules of {Xi — >■ 7j}[=i U {Xr+i — > Wr+i}- 

Theorem 5 ([5j) For any string S of length N, Re-pair constructs in 0{N) time an admissible 
grammar of size 0{g{N/ log N)'^'^), where g is the size of the smallest grammar that derives S. 

2.5.2 Bisection 

The Bisection algorithm |201l21j constructs a grammar that can be described recursively as follows: 
the variable representing string S {\S\ > 2) is derived by the rule X -^ YZ, with \Y\ = 2 and 
\Z\ = \X\ — 2 , where k is the largest integer s.t. 2 < |X|. The production rules for S[l : 2 ] 
and S[2^ + 1 : IS*!] are defined recursively. Whenever S[i : i + q — 1] = S[j : j + q — I] for some 
i, j,q > 1 which appear in the above construction, the same variable is to be used for deriving these 
substrings. 

Theorem 6 ([5]) For any string S of length N, Bisection constructs an admissible grammar of 
size 0{g{N/ log N)^'"^) , where g is the size of the smallest grammar that derives S. 

2.6 Edit-sensitive parsing (ESP) 

A string a {k > 2) is called a repetition of symbol a, and a^ is its abbreviation. We let log^ ' n = 
logn, log^*"*" •* = loglog^*''n, and log*n = min{i | log*^*^ n < 1}. For example, log*n < 5 for any 
n < 2^^^^^. We thus treat log*n as a constant for sufficiently large n. 

We assume that any context-free grammar G is admissible, i.e., G derives just one string and 
for each variable X, exactly one production rule X —?■ a exists. The set of variables is denoted by 
V{G), and the set of production rules, called dictionary, is denoted by D{G). We also assume that 
X —7- a € D{G) and y — > a G D{G) implies X = Y because one of them is unnecessary. We use 
V and D instead of V{G) and D{G) when G is omissible. The string derived by D from a string 
5 G (S U V)* is denoted by S{D). For example, when S = aYY and D = {X ^ hc,Y ^ Xa], we 
obtain S{D) = abcabca. 

For any string, it is uniquely partitioned to Wiafw2a2 ■ ■ ■Wka'^Wk+i by maximal repetitions, 
where each aj is a symbol and Wi is a string containing no repetition. Each af is called Typel 
metablock, Wi is called Type2 metablock if \wi\ > log*n, and other short Wi is called TypeS 
metablock, where if \wi\ = 1, this is attached to af_^ or af, with preference af_^ when both are 
possible. Thus, any metablock is longer than or equal to two. 

Let S he a metablock and D be a current dictionary starting with D = %. We set ESP{S, D) = 
{S', D U D') for S'{D') = S and 5' described as follows: 

1. When S is Typel or TypeS of length k > 2, 

(a) If k is even, let S' = tit2 ■ ■ ■ t/t/2i and make tj — )• S[2i — 1 : 2i] € D'. 
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Figure 2: Parsing for Typel string: Line (1) is an original Typel string 5 = a^ with its position 
blocks. Line (2) is the resulting string AAAB, and the production rules A — )• aa and B — t- aaa. 
Any TypeS string is parsed analogously. 



(2) resulting stiing 



(1) position blocks , , 
for Type2 




D 



h| lejii |T] [1] d 




Figure 3: Parsing for Type2 string: Line (1) is an original Type2 string 'adeghecadeg' with its 
position blocks by alphabet reduction where its definition is omitted in this paper. Line (2) is the 
resulting string ABCDB, and the production rules A — t- ad, B — ?• eg, etc. 



(b) If k is odd, let S' = tit2 • ■ • ^(A:-3)/2 ^) ^^'i make tj — )■ S[2i — l:2i] € D' and t — )• S[k 
k] S D' where to denotes the empty string for k = 3. 

2. When S is Type2, 

(c) for the partitioned S = siS2 • • • s^ (2 < |sj| < 3) by alphabet reduction, let S' = tit2 ■ ■ 
and make ti ^ Si a D' . 



tk, 



Cases (a) and (b) denote a typical left aligned parsing. For example, in case S 



S' 



and X ^ a^ ^ D' and in case S 



a^, S' = x^y and x ^ a? ,y ^ aaa E D'. In Case (c), we omit 
the description of alphabet reduction [9] because the details are unnecessary in this paper. 

Case (b) is illustrated in Fig. [2] for a Typel string, and the parsing manner in Case (a) is 
obtained by ignoring the last three symbols in Case (b). Parsing for Type2 is analogous. Case (c) 
is illustrated in Fig. [31 

Finally, we define ESP for any string S G {T, U V)* that is partitioned to SiS2- • • S^ by k 
metablocks; ESP{S,D) = {S',DUD') = {S[- ■ ■ S'f^,D U D'), where D' and each S'^ satisfying 
S[{D') = Si are defined in the above. 

Iteration of ESP is defined by ESP'{S, D) = ESP^'^{ESP{S, D)). In particular, ESP*{S, D) 
denotes the iterations of ESP until |5| = 1. After computing ESP*{S,D), the final dictionary 
represents a rooted ordered binary tree deriving S, which is denoted by ET{S). 



Lemma 7 (Cormode and Muthukrishnan 
be computed in time 0(|5| log*|5|) time. 



The height of ET{S) is 0(log \S\) and ET{S) can 



Lemma 8 (Cormode and Muthukrishnan [9]) Let S = siS2---Sk be the partition of a Type2 
metablock S by alphabet reduction. For any 1 < j < \S\, the block Si containing S[j] is determined 
by at most S\j — log*A^ — 5 : j + 5]. 

We refer to another characteristic of ESP for pattern embedding problem. Nodes 1^1,^2 in 
T = ET{S) are adjacent in this order if the subtrees on ui,'L'2 are adjacent in this order. A string 
Pi- • -Pk of length k is embedded in T if there exist nodes vi,. . . ,Vk such that label(vi) = pi and 
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Figure 4: The derivation tree of SLP with Xi — )■ a, X2 — > b, 
X5 — )• X3X4, Xq —> X^X^, and Xj -^ X^X^, representing string S = 



X3 -^ ^1^2, X4 -^ XiXs, 
■ val{Xj) = aababaababaab. 



any Vi,Vi^i are adjacent in this order. If T[i], the i-th leaf of T, is the leftmost leaf of vi and T[j] 
is the rightmost leaf of Vk, we call that pi- ■ ■ pk is embedded as T[i : j]. 

Definition 9 Q € (S U V)* is called an evidence of P e S* in S* if the following holds: S[i : j] = P 
iff Q is embedded as r[i : j]. 

We note that any P has at least one evidence since P itself is an evidence of P. 

Lemma 10 (Maruyama et al. |29| ) Given T = ET{S), for any T[i : i + t] = P, there exists an 
evidence Q = qi • • • Qk of P with maximal repetitions qi and k = O(logi). We can compute the Q 
in 0(logilog \S\) time, and we can also check if Q is embedded as T[j : j + t] in 0(logtlog \S\) 
time for any j. 



2.7 Straight Line Programs 

A straight line program (SLP) [18] is a widely accepted abstract model of outputs of grammar-based 
compressed methods. An SLP is a sequence of assignments {Xi -^ expri}'^^^, where each Xi is 
a variable and each expri is an expression, where expri = a (a € S), or expri = X^X^ {i,r < 
i). Namely, SLPs are admissible grammars in the Chomsky normal form, and hence outputs of 
admissible grammars can be easily converted to SLPs in linear time (see also Figure [1]). Let 
val{Xi) represent the string derived from Xi. When it is not confusing, we identify a variable Xi 
with val{Xi). Then, \Xi\ denotes the length of the string Xi derives. An SLP {Xi -^ expri}^^^ 
represents the string S = val{Xn)- The size of an SLP is the number of assignments in it. The height 
of variable Xi is denoted height{Xi), and is 1 if Xj = a (a G S), and l+m.ax{height{X£) , height{Xr)} 
if Xi = XiXr- The height of an SLP {Xi — > expri}f^i is defined to be height (Xn). 

Note that \Xi\ and height (Xi) for all variables can be calculated in a total of 0{n) time by 
simple dynamic programming iterations. In the rest of the paper, we will therefore assume that 
these values will be available. 

The following results are known for SLP compressed strings: 

Theorem 11 ( |25j ) Given two SLPs of total size n that describe strings S and P, respectively, a 
succinct representation of Occ{S,P) can be computed in 0{n^) time and 0{n'^) space. 



Since we can compute 15"! and |P| in 0{n) time and a membership query to the succinct represen- 
tation can be answered in 0{n) time [31], the equahty checking of whether S = P can be done in 
a total of O(n^) time. 

Lemma 12 ( [24| ) Given an SLP of size n representing string S, an SLP of size 0{n) which 
represents an arbitrary substring S[i : j] can be constructed in 0{n) time. 

3 Algorithms for Restructuring Compressed Texts 

In this section we present our polynomial-time algorithms that converts an input compressed rep- 
resentation to another compressed representation. In the sequel, n and m will denote the sizes of 
the input and output compressed representations, respectively. 

3.1 Conversions from Run Length Encoding 

For conversions from Run Length Encodings, we obtain the results below. 

Theorem 13 (Run Length Encoding to Re-pair) Given an RL factorization of size n that 
represents string S, the grammar of size m produced by applying Re-pair algorithm to S can be 
computed in 0{nmlogni) time. 

Proof. We consider a simple simulation of the Re-pair algorithm that works on the RL factorization 
of the string S. We shall assume that the Re-pair algorithm replaces non-overlapping bigrams with 
a new variable in a left-first manner. Let Yi —?■ Y(Yr denote the i-th. rule produced by the Re-pair 
algorithm running on S. Let Si = S, and for i > 1 let iSj+i denote the string obtained by replacing 
frequent bigrams by Yi, I2) • ■ • ) and Yi. Note that the bigram YgYr will not occur in Si^i. Consider 
the RL factorization of Si, and let Wi denote the string obtained by concatenating the RL factors 
of Si consisting of characters in S U {YjY-Z^. 

We find the most frequent bigram 1^1^ in Wi, and then replace non-overlapping occurrence of 
YiYr in Wi with a new variable Yi on the left priority basis, and then compute u)j+i. 

Let a,b,c (z S with a ^ b and a ^ c, and let ba^c be a substring of the original string S, where p > 
1. Consider any occurrence offtake that begins at position v in S, namely, let S[v : v+p + 1] = ba^c. 
There are two cases to consider: (1) the range [v : v+p + 1] is fully contained within a variable Yfc in 
Wi] (2) the range [v : v+p+1] is contained in a substring oiwi of form (y5)^(Y'fc(i))^Y'^.(2) • • • lfc(;)(l^r)* 
with A;(l) > k{2) > • • • > k{l), where val{Yf^n^)'^ ■ val{Yi^/2)) ' ' ' ^^^(^(i)) = (^^ for some p' < p, ba^ 
is a suffix of val(Ys), a^c is a prefix of vali^r), and x + p' + y = p. 

Let Y^Yr be the most frequent bigram in Wi. It is possible to replace non-overlapping occurrences 
of YiY,. in 0{n) time, as follows: We can see that val{Y£)val(Yr) occurs either (A) in a sequence 
fjfj+i ■ ■ ■ fj+d-i of d > 2 consecutive factors in wi or (B) entirely within a single factor fj of wi. 
This is because, if val(Yi)val(Yr) contains at least two distinct characters a ^ b, then it occurs in 
a sequence of d factors, and if val{Y£)val(Yr) = a^, then it is fully contained in a factor. Consider 
case (A): Let wa/(y£)[l] = c. Since the number of factors of form / = c^ does not exceed n, the 
number of occurrences of the bigram of case (A) is 0(n). Now consider case (B): According to the 
observation (2) above, any bigram YfYr with val{Y£)val{Yr) = a^ and i ^ r occurs at most once in 
each substring of Wi that corresponds to a factor fj. Hence the number of occurrences of such a 
bigram in Wi is at most n. 11 i = r, then the bigram YgY^ can occur q — 1 times at each factor. We 
then replace (Yi^ with (1^)''' ^ if q is even, and with {Yi)^'^~^>''^Y£ otherwise, in 0(1) time. 

Since each Wi consists of characters in S U {1^}*C;^, the number of all bigrams in Wi is 0{m? + 
m|S|) = 0{m^). We find the most frequent bigram in 0(log?TT.) time using a heap, and the total 



time complexity for converting the RL factorization to the grammar corresponding to Re-pair is 
O^nmlogm). I 

Theorem 14 (Run Length Encoding to LZ77/LZ88) Given an RL factorization of size n 
that represents string S, the LZ factorization of S can be computed in 0{nm + nlogn) time, where 
m is the size of LZ factorization. 

Proof. Let a^^ , ■ ■ ■ , an" be the RL factorization of a text S. Assume that we have already 
computed the first i — 1 LZ77 factors, /i,...,/i_i, of S. Let the pair of integers {u,q) satisfy 
q + Pi + ■ ■ ■ + Pu-i = l/i • • • /j-il, where 1 < n < n and 1 < q < Pu- For a new factor /j, compute 
the lengths Ij of the longest common prefix of a^+Y " " " '^"■" ^^^ each suffix a -^ • • • an" {I < j < u) of 
the RLE, where each RL factor a^ is regarded as single symbol. The length of the i-th LZ77 factor 
is then: raaxjlpu+i + • • • + Pu+l + P + Q}, where P = if a^ ^ Uj and P = inm{pu — q,Pj-i} 
otherwise, and Q = if au+/^.+i ^ o-j+ij and Q = m.m{pu+ij+i,Pj+ij} otherwise. The process is 
then repeated to obtain /j+i from the pair of integers {u + Ij + l,Pu+i +i — Q). A naive algorithm 
for obtaining each Ij costs 0{n) time, and therefore results in an 0{nm) time algorithm to check 
each of the 0{n) suffixes to construct the 0{m) factors. If we construct a suffix and Icp array on 
the RLE string beforehand, Ij can be computed in 0(1) time, since it amounts to a range minimum 
query on the Icp array. Note that the sum Pu+i + • • • + Pu+i ■ can also be obtained in constant 
time with 0{n) preprocessing, by constructing an array sum[i] = Pi + • • • + Pi and computing 
sum[u + Ij] — sum[u]. Therefore, conversion can be done in 0{nm) time provided that the suffix 
array and Icp arrays are constructed. The construction of the arrays require 0{nlogn) time, to 
sort and number each of character of the alphabet a^' . 

LZ78 factorization can be achieved by a simple modification. I 

Theorem 15 (Run Length Encoding to Bisection) Given an RL factorization of size n that 
represents string S, the grammar of size m produced by applying Bisection algorithm to S can be 
computed in 0{m? + {m + n) log n) time. 

Proof. Consider the following top-down algorithm which closely follows the description of Bisec- 
tion in Section [2.5.21 Assume we want to construct the children 1^,1^ of variable Yg representing 
S[i : j], to produce the grammar rule Kj -^ YtY^. Note that an arbitrary substring of S\i : j] 
which is contained in the RLE a^!''^ ■ ■ ■ a^^^ can be represented as a 4-tuple {x,k,l,y), where 

i=Pi-\ \-Pk-i - x + 1, j = pi-\ \-Pk+i-i + y-l, 0<x< pk-i, and < y < pk+i. Let k 

represent the largest integer where 2 < j — i + 1. For the substring S[i : j] under consideration, 
the 4-tuple for substrings S[i : i + 2^^ — 1] and S[i + 2'^ : j] can be obtained in O(logn) time. 
Note that equality checks between substrings represented as 4-tuples can be conducted in 0(1) 
time with 0(n log n) preprocessing, using range minimum queries on the Icp arrays, similar to the 
technique used in the conversion to LZ encodings. Equality checks are conducted against the 0{m) 
variables that will be contained in the output. If there exist variables which derive the same string, 
the existing variables are used in place of Yi and/or 1^, and Yi and/or Y^ will not be contained 
in the output. Since equality checks are conducted only for the children of variables which are 
contained in the output, they are conducted only 0{m) times. Therefore, conversion can be done 
in 0{n log n + m{m + log n)) = 0{nn? + (m -|- n) log n) time. I 



3.2 Conversions from arbitrary SLP 

Theorem 16 (SLP to Run Length Encoding) Given an SLP of size n that represents string 
S, the RL factorization of S can be computed in 0{n + m) time and 0{n) space, where m is the 
size of the RL factorization. 

Proof. For each variable Xi , we first compute the maximal length of the run of identical characters 
which is a prefix (resp. suffix) of Xi, denoted by plen{Xi) (resp. slen{Xi)). This can be computed 
in 0{n) time by a simple dynamic programming: for Xi — )• Xg^X^, we have plen{Xi) = plen{Xi) if 
plen{Xi) < \Xi\ or X£[|X^|] ^ Xr[l], and plen{Xi) = plen{Xi) + plen{Xr) otherwise. slen{Xi) can 
be computed likewise. 

Next, for each variable Xi -^ XgXr, let Llink{Xi) denote the variable Xj/ — ?> Xg'Xr' such that 
Xi' is the shallowest descendant of Xi lying on the left most path of the derivation tree of Xi, 
satisfying slen{Xi') < \Xr'\. Llink{Xi) can also be computed for all Xi in 0{n) time, by a simple 
dynamic programming. Rlink{Xi) can be defined and computed likewise. 

The conversion algorithm is then a top down post-order traversal on the derivation tree of SLP 
but with jumps using Llink and Rlink. For the root Xn, we output (1) Xn[l]^^^"^^"-\ (2) the 
RLE of Xn except for the first and last RL factors of X„, and (3) X„[|X„|]^''="(^"). (2) can be 
computed recursively as follows: at each variable Xi — ?• X^Xr, we output (2.1) the RLE of Llink{Xi) 
except for the first and last RL factors of Llink{Xi), (2.2) either X^[|X^|p'^"(^^)X^[l]P^'^"(^'-) if 
Xe[\Xi\] + Xr\\\, or x^[i]^'e"(-Y,)+pien{x.) -f Xf^Xi"^ = X,.[l], and (2.3) the RLE of Rlink{Xi) 
except for the first and last RL factors of Rlink{Xi). The theorem follows since the output of each 
RL factor is done in constant time. I 

Theorem 17 (SLP to LZ77) Given an SLP of size n that represents string S, the LZ77 factor- 
ization of size m can be computed in 0{mn'^ log N) time. 

Proof. Assume we have already computed fi, . . . , /j„i of S from a given SLP of size n. Firstly 
we consider the LZ77 factorization without self-references. For a new factor /j, do a binary search 
on the length of the factor: create a new SLP of that length, and conduct pattern matching on 
the input SLP. If a match exists in the range that corresponds to the previous factors /i, . . . , fi-i, 
i.e., in the prefix S[l : Yl]~ IfjW of 'S'l then the length of fi can be longer, and if not, it must be 
shorter. Using Theorem 1 1 1 1 and Lemma [T2] the LZ77 factorization of size m can thus be computed 
in 0{mn^ log N) time. To compute the LZ77 factorization with self-references, we search for the 
longest match that begins at a position from 1 to Yl]~ \fj\ in 5. The time complexity is the same 

as above. I 

Theorem 18 (SLP to LZ78) Given an SLP of size n that represents string S, the LZ78 factor- 
ization of size m can be computed in 0{n^m?) time. 

Proof. Our algorithm for converting an SLP to LZ78 follows a similar idea: When computing a new 
factor fi, we construct a new SLP of /fc/fc+i[l] for each 1 < k < i, and run the pattern matching 
algorithm on the input SLP. The longest match in the suffix S[^^j^-^ \fj\ + 1 : \S\] provides the 
new factor fi. By Theorem 1111 and Lemma [T2l pattern matching tasks for computing each factor 
fi takes 0{n^m) time, and therefore the total time complexity is 0{n^m'^). I 



3.3 SLP to Bisection 

Theorem 19 Given an SLP of size n that represents string S, the grammar of size m produced by 
applying Bisection algorithm to S can be computed in 0{n^m^) time. 

Proof. Given an arbitrary SLP of size n representing S, consider the following top-down algorithm 
which closely follows the description of Bisection in Section 12. 5.21 Assume we want to construct 
the children Yi, 1^ of variable Yg representing S[i : j], to produce the grammar rule Ys — )• Y^Yr. Let 
k represent the largest integer where 2^ < j — i + 1. By using Lemma [T2l SLPs Yi representing 
S[i : i + 2 — 1] and Y^ representing S[i + 2 : j], can be constructed in 0{n) time. For these SLPs, 
equality checks are conducted against all 0{m) variables corresponding to variables that will be 
contained in the output produced so far. If there exist variables which derive the same string, the 
existing variables are used in place of Y^ and/or Y^, and 1^ and/or Y^ will not be contained in the 
output. From Theorem 111! the equality checks for Y^ and Y^ can be conducted in a total of 0{n^m) 
time. Since equality checks are conducted only for the children of variables which are contained in 
the output, the total time is 0{n'^m?). I 

3.4 Conversions to and from ESP 

Given a representation of SLP G for a string 5, we design algorithms to compute LZ77 and LZ78 
factorizations for S without explicit decompression of G in 0{{n + m) log A'^) time. Here n/m is 
the size of input/output grammar size, A^ = |5|, and d is a constant. Our method is based on the 
transformation of any SLP to its canonical form by way of an equivalent ESP. 

Lemma 20 Given a dictionary D from ESP*{S, D) for some S G E*, and the set V of variables 
in D, we can compute an SLP with the dictionary D' and the set V of variables which satisfies the 
following conditions: (1) \D'\ < 2\D\ and (2) for any Xi,Xj € V , val{Xi) <iex val{Xj) iff i < j, 
where <iex denotes the lexical order over S. The computation time is 0(n log n log A) for \V\ = n 

and \S\ = N . 

Proof. Consider Tx = ET{val{X)) and Ty = ET{val{Y)) for any X,Y € V. Let t = 
Wval{Y)\/2\. By Lemma [TOl we can compute an evidence Q of the pattern Ty[l : t] in 0(log^ t) = 
0(log A^) time. We can also check if Q is embedded as Tx[^ '■ t] in 0(log A^) time. By this binary 
search, we can find the length of longest common prefix of val{X) and val{Y) in 0(log A^) time. 
Thus, we can sort all variables in V in 0(nlognlog A) time. After sorting all variables in V, we 
rename any variable according to its rank. If there is a variable X with X -^ XiXjX^, we divide 
it to X ^ YXk and Y -^ XiXj by an intermediate variable Y and we can determine the rank of 
such new variables in additional O(nlogn) time. I 

Dictionaries Di,D2 of two admissible grammars are called consistent ii X — )• a, 1" — > a € D1UD2 
implies X = Y, and consistent dictionaries Di, . . . , Dk are similarly defined. 

For Q S (S U V)* , a = qi ■ ■ ■ Qk is called a run-length representation of a if each qi is a maximal 
repetition oi pt G Ti L) V . For example, the run-length representation of abbaaacaa is qiq2Q3Q'iQ5 = 
ab'^a^ca^ . The number k oi a = qi- ■ ■ qk'^s called the change of a. 

Let S = a^-f and S' = a'^'i satisfying ESP{S, D) = {S',DUD') with a'{D') = a, /?'(£»') = /?, 
and 'y'{D') = 7. Then we call such S = 0/^7 a stable decomposition of S. An expression 
ESP{a[p]'y,D) = {a'[j3']-f',D U D') denotes an ESP to replace the a/^/7 to the ol j ^' ji , re- 
spectively. For a string a, a and a denote a prefix of a and a suffix of a, respectively. 
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Lemma 21 Let ESP{a{l3]'y,D) = (a'[/3]'7', D U D') for a stable decomposition S = a/37. There 
exist substrings a, a/3, j3^, 7, each of whose change is at most log*|5| + 5 such that 

ESP{[a]Wf,D) = {[a']yi,DUDi), 

ESP{a[l3]j,D) = {x2[(3']y2,DUD2), 

ESPiag^iD) = (xsWlDUDs), and 

D' = D1UD2U L>3. 

Proof. Since S = 0/^7 is a stable decomposition of an ESP for S, the translated string a' and 
the dictionary Di for Di{a') = a are determined by only a and a prefix (Sj. In case p"*" is the 
maximal prefix of fS-j, we can set 13-y = p"*". Otherwise, by Lemma [H we can set /Sj to be a prefix 
of length at most log*|5| + 5. For /3,7, we can set (ry,af3 with the bounded change, respectively. 
The above ESP defines a'(Z?i) = a, /3'{D2) = /3, and ^'(D^) = 7. By renaming all variables in the 
dictionaries, there is a consistent D' = Di U D2 U D3 satisfying a' {D')fi' {D')^' {D') = a/37. I 

Lemma 22 Let D be a dictionary of an SLP encoding a string 5 G S*. A dictionary D' of an ESP 
equivalent to D is computable in 0(n log N + m) time, where n = \D\, m = \D'\, and N = \S\. 

Proof. We assume that ESP*{val{Xi), D) (i < i,j) is already computed and let D' be the current 
dictionary consistent with all val{X(). For X^ -^ XiXj (k > i,j), we estimate the time to update 
D'. Let val{Xi) = a and val{Xj) = 7. 

For the initial strings a, 7, we can obtain a of length log*A^+6 and 7 of length 6 in 0(log N log*A^) 
time. By the result of ESP{cry, D'), we determine the position block /3 which a[|a|] and 7[1] belong 
to. Then we can find a stable decomposition S = a/^7 for the obtained /3 and reformed a and 7, 
where a (and 7) is represented by a path from the root to a leaf in the derivation tree of D^ (and 
Dy). They are called a current tail and head, respectively. Note that we can avoid decoding a and 7 
for the parsing in Lemma [2TJ To simulate this, we use only the compressed representations D^, Dy, 
the current tail/head, and (3. Using the run-length representation, the change of /3 is bounded by 
0(log*iV) as follows. 

By Lemma EH when ESP{a[P]-f,D) = (a'[/3]'7',L> U D') is computed by ESP{[a]P^,D) = 
{[a']yuD U Di), ESP{a[P]j, D) = {x2[(3']y2, DUD2), and ESP{al[^],D) = (xgb'], ^ U D^), the 
resulting string /3' is treated as the next (3, and the current tail and head are replaced by a'[|a'|] 
and 7'[1] which represent the next a and 7. 

Let us consider the case ESP{[a](3j, D) = {[a']yi, DUDi). If a/37 contains a maximal repetition 
of p as a[|a|— A^i, |a|]-/37[l, A'^2] = P^ , the next tail is the parent of a[|a|— A^i — 1], which is determined 
in 0(log N) time since any repetition is replaced by the left aligned parsing and A^i + A'^2 = 0{N). 
Otherwise, by Lemma [HI we can determine the next tail by tracing a sufhx of a of length at most 
log*A^ + 5 in 0{logNlog*N) time. 

Thus, ESP{a[l3]j,D) = {a'[^]'-f',D U D') is simulated in 0{log^ N + mk) for Xk -^ XiXj, 
where mk is the number of new variables produced in this ESP. Therefore we conclude that the 
final dictionary D' equivalent to D is obtained in 0((log A^ + mi) + ■ ■ ■ + (log N + nin)) = 
Oinlog^ N + m). I 

Theorem 23 (SLP to Canonical SLP) Given an SLP D of size n for string 5 of length N, we 
can construct another SLP D' of size m in 0(n log A^ + m,log?7ilog N) such that D' is a final 
dictionary of ESP*{S,D') equivalent to D and all variables in D' are sorted by the lexical order 
of their encoded strings. 
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Theorem 24 (Canonical SLP to LZ77) Given a canonical SLP D of size n for string S of 
length A^, we can compute LZ77 factorization /i, ...,/„ of 5" in 0(?Tilog nlog A^ + nlog n) time. 

Proof. Using the technique in Lemma [20l we can sort all variables Z associated with Z — )• XY G D 
by the following two keys: the first key is the lexical order of val{X) and the second is the lexical 
order of valiY), where S^ denotes the reverse string of S* G S*. Then Z is mapped to a point 
{i,j,pos) on a 3-dimensional space such that i is an index of first key on X-axis, j is an index 
of second key on Y-axis, and pos is an index of leftmost occurrence of val{Z) on Z-axis. A data 
structure supporting range query for the point set is constructed in 0(n log n) time/space and 
achieving 0(log^ n) query time (See [26]). Using this, we can compute Z^+i from fi, ■ ■ ■ ■, fe and the 
remaining suffix S' such that S = fi ■ ■ ■ fe ■ S' as follows. 

By Lemma \T0\ an evidence Q = qi ■■■ qk of S'[l : j] satisfying k = 0{logj) is found in 0{log^ j) 
time. Let qi be a symbol. Then we guess the division a = qi- ■ ■ qi and /3 = qi+i ■ ■ ■ qu to find the 
range of X in which a is embedded as its suffix, the range of Y in which /3 is embedded as its 
prefix, and the range of Z whose leftmost position pos satisfies pos + j < \fi- ■ ■ fi\- This query time 
is 0(log nlog N). Let qi = p\ for a symbol pi. Any maximal repetition is replaced by the left 
aligned parsing, and a resulting new repetition is recursively replaced by the same manner. Thus, 
an embedding of gi ■ • ■ g^ to Z — >■ XY dividing q^ = a/3 such that gi • • • qi-ia is embedded to X as 
suffix and /^^j+i ■ ■ ■ q^is embedded to Y as prefix is possible in O(logj) = O(logiV) divisions for qi. 
In this case, the query time is 0(log ?ilog N). Therefore, the total time to compute the required 
LZ77 factorization is bounded by 0(m log nlog A^-|-nlog n). I 

Theorem 25 (Canonical SLP to LZ78) Given a canonical SLP D of size n for string S of 
length N , we can compute LZ78 factorization fi, ■ ■ ■ , fm of 5 in 0(mlog N + n) time. 

Proof. Assume that the first i factors /i,...,/^ are obtained. By Lemma [101 ^^ can find an 
evidence Qiof fi {1 < i < i), and all evidences Qi {1 < i < i) are represented by a trie. Let S' be the 
remaining suffix of S. For each j, we can compute an evidence of 5'[1 : j] in 0(log j) = 0(log N) 
time. Thus, we can find the greatest j satisfying fi = S'[l : j] for some 1 < i < £ in 0(log N) 
time using binary search. Therefore, the total time to compute the required LZ78 factorization is 
bounded by 0(7Ti log N + n). I 

Theorem 26 (Run Length Encoding to ESP) Given a text S represented as a RL encoding 
S = fi ■ ■ ■ fn of length n, we can compute an ESP D representing S in 0(n log*A^) time. 

Proof. We make a little change for replacing maximal repetition. Consider maximal repetition 
a = a in S is appeared. If k is even, then we replace a to A ''^, otherwise we replace to A'- ''^i~^B 
where A —^ aa and B — )• aaa. In exceptional case that the prefix and/or suffix of a is replaced 
with the left/right symbol adjacent to a, we must consider for a' removed such prefix/suffix from 
a. The computation time to replace such repetition is 0(1) since the number of repetitive symbols 
is represented as a integer. Therefore, the time to convert is bounded by 0(n log*A^). I 

Theorem 27 (ESP to Bisection) Given an ESP D of size n representing S of length N, we can 
compute an SLP of size m generated by Bisection in 0(?7T,log N) time. 

Proof. For each corresponding substring S[i : j] under consideration. We can obtain an evidence 
Q corresponding to S[i : i + 2^ — 1] in 0(log A^) time. By the Lemma \T0\ we can check if Q is 
embedded as T[i -\- 2 : j] in 0(log A^) time. If Q can be embedded, we can allocate same variable 
for S[i : i + 2^ — 1] and S[i + 2^ : j] since both substrings are equal, otherwise different variables 
are allocated. Therefore, conversion can be done in 0(r7ilog A^) time. I 



12 



4 Conclusions and Future Work 

In this paper we presented new efficient algorithms which, without exphcit decompression, convert 
to/from compressed strings represented in terms of run length encoding (RLE), LZ77 and LZ78 
encodings, grammar based compressor RE- PAIR and BISECTION, edit sensitive parsing (ESP), 
straight line programs (SLPs), and admissible grammars. All the proposed algorithms run in 
polynomial time in the input and output sizes, while algorithms that first decompress the input 
compressed string can take exponential time. Examples of applications of our result are dynamic 
compressed strings allowing for edit operations, and post-selection of specific compression formats. 
Future work is to extend our results to other text compression schemes, such as Sequitur |34j . 
Longest-First Substitution [32], and Greedy [l]. 
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